ADM: Market basket Analysis/Association Rules

Kenapa AA penting di Data Mining?

  1. Market Basket data analysis, cross-marketing, catalog design, sale campaign analysis

  2. Web log (click stream) analysis, DNA sequence analysis, etc.

Image Source:

  1. https://www.liputan6.com/tekno/read/2586238/pasar-online-indonesia-kian-tumbuh-ecommerce-berjaya
  2. https://ginbusiness.wordpress.com/2016/02/27/jenis-e-commerce-di-indonesia/

Image Source: https://www.semanticscholar.org/paper/Collaborative-Filtering-and-Artificial-Neural-Based-Mylavarapu/a2a9f263172691fc66a4c8bc1d44dd4df77c347c

Image Source: https://www.themarketingtechnologist.co/building-a-recommendation-engine-for-geek-setting-up-the-prerequisites-13/

Association Analysis (AA)

  • Mencari hubungan (links) antar variabel menurut himpunan records di data
  • Links ini disebut sebagai asosiasi (ASSOCIATION).
  • Tiga tipe permasalahan asosiasi:

  1. Association discovery (tidak terurut - yang akan dibahas di kuliah ini)

  2. Sequential pattern discovery (terurut - tidak dibahas)
  3. Similar time discovery (ada informasi waktu - misal log analysis)

Association Rules ~ Market Basket AnalysisΒΆ

image Source: https://www.kdnuggets.com/2018/07/minimum-viable-data-product.html

image source: https://www.youtube.com/watch?v=VZL6uhA8XKg

Association Rules (AR) dalam satu paragraph

AR berusaha menemukan semua himpunan ITEM (ITEMSETS) yang memiliki SUPPORT lebih besar dari MINIMUM SUPPORT, kemudian menggunakan itemsets yang signifikan untuk menghasilkan RULES yang memiliki CONFIDENCE lebih besar dari suatu MINIMUM CONFIDENCE. Rules ini akan dinilai berharga (signifikan) berdasarkan nilai LIFT-nya. Aplikasi paling populer AR adalah Market Basket Analysis (MBA).

Items dan Itemsets

  • Data AR berbentuk "transaksi": himpunan itemsets yang masing-masing elemen himpunannya adalah items
  • Items: Bread, Milk, Coke, dll
  • Itemset: {Bread, Milk}
  • Contoh transaksi pada suatu hari di sebuah toko:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
4 Bread, Milk, Diaper, Coke

Secara Formal (Ringkasan Teori AR)

  • Item adalah elemen himpunan dari data, contoh:Β Milk,Bread,Eggs
  • Itemset adalah kemungkinan subset yang dibentuk dari item, contoh:Β  {Milk,Bread,Eggs} atau {Milk, Eggs}.
  • Frekuensi kemunculan item atau itemset dalam data disebut Support:
  • Jika support > dari suatu nilai ambang (threshold) maka itemset tersebut disebutΒ frequent itemset.
  • Sebuah Rule berbentukΒ Xβ‡’Y dimanaΒ XΒ (Antecedent) danΒ YΒ (Consequent) adalahΒ itemsets. Contoh:
  • {Milk,Diaper}β‡’{Beer}
  • Support dari sebuah rule adalah banyaknya transaksi yang memuat X dan Y.
  • s(Xβ‡’Y)=s(XβˆͺY)
  • Dalam association rule mining, kita ingin mencari Rules yang memilikiΒ Β support and confidence yang signifikan.Β 
  • Nilai expected confidence tak bersyaratΒ di AR disebut juga sebagai "lift:"
  • Lift<1 dianggap "negatif" (less than expected)
    Lift = 1 : netral
  • ["lift"] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data

Contoh Rule:ΒΆ

Mie Instant ==> Saos SambalΒΆ

Rules digunakan dalam marketing untuk membuat berbagai keputusan, beberapa contohnya:

  • Letakkan kedua barang berdekatan (agar ndak lupa keduanya untuk dibeli)
  • Letakkan kedua barang berjauhan (agar konsumen akan melihat-lihat barang yang lain)
  • Satukan kedua barang dalam sebuah promo (promo akan jadi lebih menarik karena konsumen memang membutuhkan keduanya)
  • Satukan kedua barang dengan barang lain yang kurang laku (Cross selling)
  • Naikkan barang yang satu dan turunkan yang lain (teknik kompetisi dengan "toko sebelah")
  • Jangan iklankan kedua barang bersamaan.
  • Tawarkan promo saos dalam bentuk sachet gratis setiap membeli mie instan premium.

Rule, Support, Confidence, Lift by ExampleΒΆ

Image Source: http://www.saedsayad.com/association_rules.htmΒΆ

SupportΒΆ

Support rule A==>B adalah probabilitas A dan B muncul bersamaan: $$ Support(A==>B) = \frac{|A \cap B|}{|T|} $$ dimana $|A\cap B|$ adalah jumlah transaksi yang mengandung produk A dan B dan $|T|$ adalah total transaksi yang ada.

ConfidenceΒΆ

Confidence rule A=>B adalah probabilitas bersyarat dari B jika diketahui A, di AR dihitung sebagai: $$ Confidence(A=>B) = \frac{|A\cap B|}{|B|}$$

LiftsΒΆ

Lift rule A=>B adalah sebuah ukuran seberapa lebih sering A dan B muncul bersamaan dibandingkan jika mereka saling bebas secara statistika. Jika A dan B saling bebas maka Lift(A=>B)=1 dan jika lift positif maka dikatakan A dan B berkorelasi positif dan negatif untuk sebaliknya. Lift(A=>B) dihitung sebagai: $$ lift(A=>B)=\frac{confidence(A=>B)}{P(B)}=\frac{P(A\cap B)}{P(A)P(B)}$$ Perhatikan Lift bersifat simetris: Lift(A=>B) = Lift(B=>A)

LeverageΒΆ

Leverage mirip dengan lift, hanya saja Leverage menghitung perbedaan (selisih instead of perbandingan seperti lift) antara frekuensi A dan B muncul bersamaan dan frekuensi A dan B jika ia independent. Nilai leverage = 0 menandakan saling bebas antara A dan B. Leverage dihitung sebagai: $$ Leverage(A=>B)= Support(A=>B) - Support(A) \times Support(B)$$

Referensi untuk Leaverage: Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 1991: p. 229-248.

RangkumanΒΆ

Semua aturan diatas dengan apik dirangkum sebagai berikut:

image source: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/%22%3Ehttp://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ΒΆ

Pemilihan RulesΒΆ

Pada aplikasinya AR akan menghasilkan banyak sekali Rules dari data. Namun demikian tentu saja tidak semua rules ini akan digunakan dalam pengambilan kebijakan. Untuk mengurangi jumlah rule, sebaiknya barang-barang (sejenis) dikategorikan/kelompokkan terlebih dahulu. Kemudian akan dipilih rule-rule yang memenuhi kriteria berikut (mengapa? Silahkan diskusikan di Forum):

  • Rule dengan Lift besar dan kecil.
  • Items yang paling sering (dan jarang) muncul.

Prinsip Apriori (Sifat anti-monotone)ΒΆ


Jika sebuah itemset sering muncul, maka semua subset-nya juga pasti sering muncul. Begitupula kebalikannya juga berlaku, jika sebuah itemset jarang muncul, maka semua superset-nya pasti juga jarang muncul. Secara formal dituliskan $$ \forall A, B : (A\subset B) => s(A) \geq s(B) $$ Atau dengan kata lain support itemset tidak akan pernah melebihi support dari subset-nya. Sifat ini menjadi sangat penting nanti untuk mengurangi komputasi (Computational Complexity) dari perhitungan rules dari data.

Algoritma Association Rules:ΒΆ

Walau teori dari AR cukup sederhana, namun terdapat cukup banyak algoritma di AR, diantaranya AIS, Apriori, SETM, AprioriTid, Apriori Hybrid. Kebanyakan dari algoritma ini berbeda karena perbedaan upaya untuk mengurangi komputasi. Di kesempatan ini hanya akan dibahas secara sekilas algoritma AIS dan Apriori.

Algoritma AIS:ΒΆ

  • Kandidat itemset dihasilkan dan dihitung frekuensinya seiring dengan munculnya data baru.
  • Untuk setiap transaksi, ditentukan itemset besar mana yang terdapat dalam transaksi ini berdasarkan data yang ada.
  • Kandidat itemset baru dihasilkan dengan memperluas itemset-itemset yang ada dengan item-tem lain di dalam transaksi yang ada.
  • lebih jelasnya dapat dilihat pada gambar berikut:
  • Kekurangan algoritma AIS adalah menghasilkan terlalu banyak kandidat itemset yang ternyata bernilai kecil.

ilustrasi algoritma AIS dan pengaplikasian minimum support untuk mengurangi komputasi.

Algoritma Apriori

  1. Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database.
  2. The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.
  3. Each generated itemset that has a subset which is not large is deleted. The remaining itemsets are the candidate ones.

Image Source: http://www.saedsayad.com/association_rules.htm

Algoritma Lain:ΒΆ

  • SETM Algorithm
  • AprioriTid Algorithm
  • AprioriHybrid Algorithm
  • dsb

Diskusi:ΒΆ

  • Barang di toko terlalu banyak macamnya ==> how to deal with it?
  • AR inferential? Seberapa sering rule di generate?

Referensi:

[1]: J. Han, J. Pei, Y. Yin, R. Mao.
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. 2004.https://www.cs.sfu.ca/~jpei/publications/dami03_fpgrowth.pdf
[2]: R. Agrawal, C. Aggarwal, V. Prasad.
Depth first generation of long patterns. 2000.Β http://www.cs.tau.ac.il/~fiat/dmsem03/Depth%20First%20Generation%20of%20Long%20Patterns%20-%202000.pdf
[3]: R. Agrawal, et al.
Fast Discovery of Association Rules. 1996.Β http://cs-people.bu.edu/evimaria/cs565/advances.pdf

Beberapa Modul Model Rekomendasi di Python:ΒΆ

  • Crab (discontinued).
  • Surprise
  • Python Recsys (discontinued/very Old)
  • MRec (discontinued)
  • mlxtend (very limited documentation)
  • PyCaret: https://pycaret.readthedocs.io/en/latest/api/arules.html

Kita akan mencoba juga Orange: Python GUI untuk Data Mining.

Image Source: http://gp.mx.tl/oranged-net-software

Add-ons di OrangeΒΆ

  • Installing Add-Ons "Associate" di Orange
  • pip install orange3
  • python -m Orange.canvas (Create shortcut with this command)
  • Install the add-on
  • http://orange3-associate.readthedocs.org/

File data CSV/XLS(X) di Orange (3 Headers Format)

[1]. Feature Names (Nama variabel).

[2]. Feature typesΒ on the second line. The type is determined automatically, or, if set, can be any of the following:

  • discreteΒ (orΒ d) β€” imported asΒ Orange.data.DiscreteVariable,
  • a space-separatedΒ list of discrete values, like β€œmaleΒ female”, which will result inΒ Orange.data.DiscreteVariableΒ with those values and in that order. If the individual values contain a space character, it needs to be escaped (prefixed) with, as common, a backslash (β€˜') character.
  • continuousΒ (orΒ c) β€” imported asΒ Orange.data.ContinuousVariable,
  • stringΒ (orΒ s, orΒ text) β€” imported asΒ Orange.data.StringVariable,
  • timeΒ (orΒ t) β€” imported asΒ Orange.data.TimeVariable, if the values parse asΒ ISO 8601Β date/time formats,
  • basketΒ β€” used for storing sparse data. More on basket formats in a dedicated section.

[3]. FlagsΒ (optional) on the third header line. Feature’s flag can be empty, or it can contain, space-separated, a consistent combination of:

  • classΒ (orΒ c) β€” feature will be imported as a class variable. Most algorithms expect a single class variable.
  • metaΒ (orΒ m) β€” feature will be imported as a meta-attribute, just describing the data instance but not actually used for learning,
  • weightΒ (orΒ w) β€” the feature marks the weight of examples (in algorithms that support weighted examples),
  • ignoreΒ (orΒ i) β€” feature will not be imported,
  • <key>=<value>Β custom attributes.

Contoh di Orange

  • Contoh diambil dari https://blog.biolab.si/2016/04/25/association-rules-in-orange/
  • Input Data : "FoodMart 2000 Dataset"
  • Drag Node "DataSet"Β  (Open) ==> "Send Data"
  • Drag Node "Data" ==> Open/Send Automatically
  • Drag Nodes Frequent ItemSets
  • Drag Nodes Association Rules

InΒ [1]:
import warnings; warnings.simplefilter('ignore')

file_ = 'data/Online_Retail.csv'
try:
    import google.colab; IN_COLAB = True
    print("Installing the required modules")
    !pip install mlxtend
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/{file_}
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")
Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded
InΒ [2]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from itertools import combinations
from collections import Counter
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#from pycaret.arules import *

%matplotlib inline
plt.style.use('bmh'); sns.set()
InΒ [3]:
# In Python
T = [
 ('Bread', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Eggs', 'Milk', 'Bread', 'Milk', 'Milk'),
 ('Beer', 'Coke', 'Diaper', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Milk'),
 ('Bread', 'Coke', 'Diaper', 'Milk', 'Diaper'),
]
T
Out[3]:
[('Bread', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Eggs', 'Milk', 'Bread', 'Milk', 'Milk'),
 ('Beer', 'Coke', 'Diaper', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Milk'),
 ('Bread', 'Coke', 'Diaper', 'Milk', 'Diaper')]
InΒ [4]:
# Calculating item sets
# Nostalgia Matematika Diskrit :)
def subsets(S, k):
    return [set(s) for s in combinations(S, k)]

subsets({1, 2, 3, 7, 8}, 2)
Out[4]:
[{1, 2},
 {1, 3},
 {1, 7},
 {1, 8},
 {2, 3},
 {2, 7},
 {2, 8},
 {3, 7},
 {3, 8},
 {7, 8}]
InΒ [5]:
# Calculating support
Counter(T[1])
Out[5]:
Counter({'Milk': 3, 'Bread': 2, 'Beer': 1, 'Diaper': 1, 'Eggs': 1})
InΒ [6]:
# Using Module
# Taken from https://pbpython.com/market-basket-analysis.html
# Pertama-tama load Data
try:
    df = pd.read_csv('data/Online_Retail.csv')
except:
    df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head(10)
Out[6]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A white hanging heart t-light holder 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 white metal lantern 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B cream cupid hearts coat hanger 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G knitted union flag hot water bottle 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E red woolly hottie white heart. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
5 536365 22752 set 7 babushka nesting boxes 2 2010-12-01 08:26:00 7.65 17850.0 United Kingdom
6 536365 21730 glass star frosted t-light holder 6 2010-12-01 08:26:00 4.25 17850.0 United Kingdom
7 536366 22633 hand warmer union jack 6 2010-12-01 08:28:00 1.85 17850.0 United Kingdom
8 536366 22632 hand warmer red polka dot 6 2010-12-01 08:28:00 1.85 17850.0 United Kingdom
9 536367 84879 assorted colour bird ornament 32 2010-12-01 08:34:00 1.69 13047.0 United Kingdom
InΒ [7]:
# Preprocessing
df['Description'] = df['Description'].str.strip() # remove unnecessary spaces
df['Description'] = df['Description'].str.lower() # lower case normalization
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True) # delete rows with no invoice no
df['InvoiceNo'] = df['InvoiceNo'].astype('str') # Change data type
df = df[~df['InvoiceNo'].str.contains('c')] # remove invoice with C in it
df.head()
Out[7]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A white hanging heart t-light holder 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 white metal lantern 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B cream cupid hearts coat hanger 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G knitted union flag hot water bottle 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E red woolly hottie white heart. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
InΒ [8]:
df.to_csv("data/Online_Retail.csv", encoding='utf8', index=False)
'Done'
Out[8]:
'Done'
InΒ [9]:
filter_ = {'pls', 'plas'}
for f in filter_:
    df = df[~df['InvoiceNo'].str.contains(f)] # filtering invoice
InΒ [10]:
print(set(df['Country']))
{'Hong Kong', 'Unspecified', 'Australia', 'Belgium', 'Brazil', 'Norway', 'Sweden', 'Poland', 'France', 'European Community', 'Netherlands', 'Bahrain', 'Saudi Arabia', 'Japan', 'Lithuania', 'Greece', 'Austria', 'Denmark', 'Malta', 'Germany', 'Canada', 'Channel Islands', 'United Kingdom', 'EIRE', 'Lebanon', 'Portugal', 'Czech Republic', 'Italy', 'Finland', 'Cyprus', 'Switzerland', 'Iceland', 'Israel', 'USA', 'RSA', 'Spain', 'United Arab Emirates', 'Singapore'}
InΒ [11]:
df_A = df[df['Country'] =="Australia"]
df_A.head()
Out[11]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
197 536389 22941 christmas lights 10 reindeer 6 2010-12-01 10:03:00 8.50 12431.0 Australia
198 536389 21622 vintage union jack cushion cover 8 2010-12-01 10:03:00 4.95 12431.0 Australia
199 536389 21791 vintage heads and tails card game 12 2010-12-01 10:03:00 1.25 12431.0 Australia
200 536389 35004C set of 3 coloured flying ducks 6 2010-12-01 10:03:00 5.45 12431.0 Australia
201 536389 35004G set of 3 gold flying ducks 4 2010-12-01 10:03:00 6.35 12431.0 Australia
InΒ [12]:
type(df_A)
Out[12]:
pandas.core.frame.DataFrame
InΒ [13]:
# Let's sample the data
basket = df[df['Country'] =="Australia"]
basket.head()
Out[13]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
197 536389 22941 christmas lights 10 reindeer 6 2010-12-01 10:03:00 8.50 12431.0 Australia
198 536389 21622 vintage union jack cushion cover 8 2010-12-01 10:03:00 4.95 12431.0 Australia
199 536389 21791 vintage heads and tails card game 12 2010-12-01 10:03:00 1.25 12431.0 Australia
200 536389 35004C set of 3 coloured flying ducks 6 2010-12-01 10:03:00 5.45 12431.0 Australia
201 536389 35004G set of 3 gold flying ducks 4 2010-12-01 10:03:00 6.35 12431.0 Australia
InΒ [14]:
# Group the transaction
basket = basket.groupby(['InvoiceNo', 'Description'])['Quantity']
basket.head()
Out[14]:
197        6
198        8
199       12
200        6
201        4
          ..
497681    20
497682    24
497683    20
497684    12
497685    12
Name: Quantity, Length: 1259, dtype: int64
InΒ [15]:
basket.sum().unstack()
Out[15]:
Description 10 colour spaceboy pen 12 pencil small tube woodland 12 pencils tall tube posy 12 pencils tall tube red retrospot 16 piece cutlery set pantry design 20 dolly pegs retrospot 3 hook hanger magic garden 3 stripey mice feltcraft 3 tier cake tin green and cream 3 tier cake tin red and cream ... wrap doiley design wrap dolly girl wrap english rose wrap i love london wrap poppies design wrap red apples wrap red vintage doily wrap vintage leaf design wrap wedding day yellow giant garden thermometer
InvoiceNo
536389 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
537676 NaN NaN NaN NaN NaN 24.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
539419 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
540267 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
540280 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
C560540 NaN NaN NaN NaN NaN NaN NaN -1.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN -1.0 NaN NaN
C561227 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
C568694 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
C574019 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
C574344 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

69 rows Γ— 609 columns

InΒ [16]:
# Jumlahkan, unstack, Null=0, index baris menggunakan Nomer Invoice
basket = basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket.head()
Out[16]:
Description 10 colour spaceboy pen 12 pencil small tube woodland 12 pencils tall tube posy 12 pencils tall tube red retrospot 16 piece cutlery set pantry design 20 dolly pegs retrospot 3 hook hanger magic garden 3 stripey mice feltcraft 3 tier cake tin green and cream 3 tier cake tin red and cream ... wrap doiley design wrap dolly girl wrap english rose wrap i love london wrap poppies design wrap red apples wrap red vintage doily wrap vintage leaf design wrap wedding day yellow giant garden thermometer
InvoiceNo
536389 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
537676 0.0 0.0 0.0 0.0 0.0 24.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
539419 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
540267 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
540280 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows Γ— 609 columns

InΒ [17]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units) # one-hot encoding
basket_sets.head()
Out[17]:
Description 10 colour spaceboy pen 12 pencil small tube woodland 12 pencils tall tube posy 12 pencils tall tube red retrospot 16 piece cutlery set pantry design 20 dolly pegs retrospot 3 hook hanger magic garden 3 stripey mice feltcraft 3 tier cake tin green and cream 3 tier cake tin red and cream ... wrap doiley design wrap dolly girl wrap english rose wrap i love london wrap poppies design wrap red apples wrap red vintage doily wrap vintage leaf design wrap wedding day yellow giant garden thermometer
InvoiceNo
536389 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
537676 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
539419 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
540267 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
540280 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows Γ— 609 columns

Understanding the Data StructureΒΆ

InΒ [18]:
basket_sets.columns
Out[18]:
Index(['10 colour spaceboy pen', '12 pencil small tube woodland',
       '12 pencils tall tube posy', '12 pencils tall tube red retrospot',
       '16 piece cutlery set pantry design', '20 dolly pegs retrospot',
       '3 hook hanger magic garden', '3 stripey mice feltcraft',
       '3 tier cake tin green and cream', '3 tier cake tin red and cream',
       ...
       'wrap doiley design', 'wrap dolly girl', 'wrap english rose',
       'wrap i love london', 'wrap poppies  design', 'wrap red apples',
       'wrap red vintage doily', 'wrap vintage leaf design',
       'wrap wedding day', 'yellow giant garden thermometer'],
      dtype='object', name='Description', length=609)
InΒ [19]:
basket_sets.index
Out[19]:
Index(['536389', '537676', '539419', '540267', '540280', '540557', '540700',
       '541149', '541271', '541520', '541657', '542542', '543357', '543372',
       '543376', '543989', '545065', '545475', '546135', '547659', '548661',
       '549313', '552956', '553546', '554037', '554126', '556917', '556918',
       '558536', '558537', '559919', '559920', '560033', '560473', '560491',
       '561040', '561228', '563179', '563614', '565145', '565146', '565466',
       '567085', '568145', '568687', '568695', '568708', '569647', '569650',
       '569722', '569723', '574014', '574138', '574469', '576394', '576586',
       '578459', 'C538723', 'C543375', 'C545525', 'C548729', 'C551348',
       'C555046', 'C555288', 'C560540', 'C561227', 'C568694', 'C574019',
       'C574344'],
      dtype='object', name='InvoiceNo')
InΒ [20]:
basket_sets.iloc[0]
Out[20]:
Description
10 colour spaceboy pen                0
12 pencil small tube woodland         0
12 pencils tall tube posy             0
12 pencils tall tube red retrospot    0
16 piece cutlery set pantry design    0
                                     ..
wrap red apples                       0
wrap red vintage doily                0
wrap vintage leaf design              0
wrap wedding day                      0
yellow giant garden thermometer       0
Name: 536389, Length: 609, dtype: int64
InΒ [21]:
basket_sets.loc['553546'].sum()
Out[21]:
73
InΒ [22]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False, na_position='last', inplace = True)
frequent_itemsets
C:\anaconda\envs\Teaching\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py:109: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
  warnings.warn(
Out[22]:
support itemsets
33 0.130435 (set of 3 cake tins pantry design)
28 0.130435 (red toadstool led night light)
31 0.115942 (roses regency teacup and saucer)
15 0.115942 (lunch bag red retrospot)
4 0.115942 (baking set spaceboy design)
... ... ...
11 0.072464 (homemade jam scented candles)
7 0.072464 (circus parade lunch box)
6 0.072464 (blue happy birthday bunting)
5 0.072464 (black/blue polkadot umbrella)
61 0.072464 (regency cakestand 3 tier, spaceboy lunch box,...

62 rows Γ— 2 columns

InΒ [23]:
type(frequent_itemsets)
Out[23]:
pandas.core.frame.DataFrame
InΒ [24]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values(by='lift', ascending=False, na_position='last', inplace = True)
rules.head(5)
Out[24]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
73 (roses regency teacup and saucer, dolly girl l... (regency cakestand 3 tier, spaceboy lunch box) 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.0
71 (spaceboy lunch box, roses regency teacup and ... (regency cakestand 3 tier, dolly girl lunch box) 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.0
70 (regency cakestand 3 tier, dolly girl lunch box) (spaceboy lunch box, roses regency teacup and ... 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.0
68 (regency cakestand 3 tier, spaceboy lunch box) (roses regency teacup and saucer, dolly girl l... 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.0
0 (spaceboy lunch box) (dolly girl lunch box) 0.086957 0.086957 0.086957 1.0 11.5 0.079395 inf 1.0
InΒ [25]:
type(rules)
Out[25]:
pandas.core.frame.DataFrame
InΒ [26]:
rules.shape
Out[26]:
(78, 10)
InΒ [27]:
# Filtering
rules[ (rules['lift'] >= 10) & (rules['confidence'] >= 0.9) ]
Out[27]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction zhangs_metric
73 (roses regency teacup and saucer, dolly girl l... (regency cakestand 3 tier, spaceboy lunch box) 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.000000
71 (spaceboy lunch box, roses regency teacup and ... (regency cakestand 3 tier, dolly girl lunch box) 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.000000
70 (regency cakestand 3 tier, dolly girl lunch box) (spaceboy lunch box, roses regency teacup and ... 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.000000
68 (regency cakestand 3 tier, spaceboy lunch box) (roses regency teacup and saucer, dolly girl l... 0.072464 0.072464 0.072464 1.0 13.8 0.067213 inf 1.000000
0 (spaceboy lunch box) (dolly girl lunch box) 0.086957 0.086957 0.086957 1.0 11.5 0.079395 inf 1.000000
38 (circus parade lunch box, spaceboy lunch box) (dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
1 (dolly girl lunch box) (spaceboy lunch box) 0.086957 0.086957 0.086957 1.0 11.5 0.079395 inf 1.000000
41 (circus parade lunch box) (spaceboy lunch box, dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
44 (spaceboy lunch box, roses regency teacup and ... (dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
46 (roses regency teacup and saucer, dolly girl l... (spaceboy lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
56 (circus parade lunch box) (dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
54 (circus parade lunch box) (spaceboy lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
30 (spaceboy lunch box, roses regency teacup and ... (regency cakestand 3 tier) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
64 (regency cakestand 3 tier, spaceboy lunch box,... (dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
66 (regency cakestand 3 tier, roses regency teacu... (spaceboy lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
67 (spaceboy lunch box, roses regency teacup and ... (regency cakestand 3 tier) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
69 (regency cakestand 3 tier, roses regency teacu... (spaceboy lunch box, dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
39 (circus parade lunch box, dolly girl lunch box) (spaceboy lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
29 (regency cakestand 3 tier, roses regency teacu... (spaceboy lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
2 (alarm clock bakelike red) (alarm clock bakelike green) 0.086957 0.086957 0.086957 1.0 11.5 0.079395 inf 1.000000
3 (alarm clock bakelike green) (alarm clock bakelike red) 0.086957 0.086957 0.086957 1.0 11.5 0.079395 inf 1.000000
10 (regency cakestand 3 tier, roses regency teacu... (dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
12 (roses regency teacup and saucer, dolly girl l... (regency cakestand 3 tier) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
20 (regency cakestand 3 tier, spaceboy lunch box) (dolly girl lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375
21 (regency cakestand 3 tier, dolly girl lunch box) (spaceboy lunch box) 0.072464 0.086957 0.072464 1.0 11.5 0.066163 inf 0.984375

End of Module Association RuleΒΆ